This is a patch/re-dump of
the "Ten Billion" text-only 4Chan thread archive, which is a dump of 10.8 million threads/162 million posts posted from 2005-2008 and scraped by an anonymous source (packaged in 2009 and uploaded to archive.org in 2018).
The original upload had some issues that prevented it from being fully read. This upload takes the file chanarchive.tar.gz (probably no relation to 4chanarchive/chanarchive) in the original (a tar of MyISAM database files), patches the corruption that makes the posts table garbled past a certain point, and presents it in a format that can be ingested by newer MySQL verions (5.7 or 8 - the original used 5.0.51). The XML and HTML formats in the original upload are subsets of the data in chanarchive.tar.gz (e.g. the XML version is missing /b/ and /p/; the HTML version is missing timestamps and post numbers, and its /b/ is corrupt; both are missing /con/).
Unpack and import into MySQL 5.7 or 8 in the usual way. The imported DB will have a single posts table and a single threads table. You can write a join between them to match posts to their thread numbers, source websites (a handful are from outside 4Chan) and boards.
Alternatively, to replicate this dump from the original dump on Linux...
- Download the original chanarchive.tar.gz. cd into the download directory.
- Patch the file with 'printf '\x6e' | dd conv=notrunc of=chanarchive.tar.gz bs=1 seek=$((0x8cee0ac3))'. This replaces the 1 corrupt byte/bit which was causing the problem.
- Verify the fix by running 'gzip -tv chanarchive.tar.gz', this will check that the uncompressed tar file size and CRC are correct.
- Unpack chanarchive.tar.gz into the database directory of a MySQL 5.7 installation, then run mysqldump on the resulting 'chanarchive' database.
As of this upload, the 4Chan posts were search indexed at
old.sage.moe.